The Problem

Many people are aware of Presidential and so-called “off-cycle” or midterm elections, but many don’t realize that Utah has elections every year. Voter identification in Presidential election years has long been fairly simple. It is somewhat harder to identify mid-term election voters, but increasing campaign budgets have allowed even some down-ballot office-seekers to do rudimentary data mining operations. In odd years, cities hold non-partisan municipal elections. In alternating 2 year intervals cities elect a Mayor and some councilmembers or the other members of the city council. The mayoral election generally has a higher turnout “off-off cycle” and the other election its lucky if anyone even knows it’s happening, “off-off-off cycle”. 2019 was a non-mayoral city council race in Saratoga Springs.

Local elections are often plagued by a lack of interest, or awareness. They also have minimal funds devoted to them, and are often run without staff and primarily by the candidate and close friend volunteers. Saratoga Springs is a city of roughly 35,000 residents, approximately 13,000 of which are registered to vote. Municipal elections typically run 15-25% voter turnout meaning that roughly 3,000 people can be expected to vote. The difference for a candidate sending a mailer to 3,000 voters versus 13,000 can equate to several thousand dollars. Since most candidates in Saratoga Springs have less than $3000 to spend, effective and efficient use of money is paramount.

The goal is therefore to accurately identify registered voters who are most likely to participate in the 2019 Saratoga Springs municipal election and to create a model that will allow prediction of future voter turnout as well.

Data Resources

The data available is public voter records obtained from the Utah County Clerk. There are 11,330 observations which contain 136 attributes including some limited demographic information, as well as voting history (if they voted, not for whom). As was noted above, there are more than 13,000 registered voters, so roughly 2000 voters’ information is not contained in these records as they have opted to have their information kept private. This poses an additional challenge to comprehensive identification. This data was anonymized, and in some instances transformed in order to facilitate use for modelling. In addition, addresses were geocoded to Lat/Lon coordinates using the Google Maps API.

Data Exploration

The data was explored for erroneous data and outliers, a few instances were identified and excluded from the data set. Several attributes were discarded due to the fact that they are the same for all observations due to the limited geographical area. One of the provided demographics is age. This was plotted, and a fairly normal distribution was observed. Saratoga Springs is slightly right-skewed. This age distribution was also explored to see if it differed markedly by voting precinct. Apart from a slightly higher distribution in SR07, there was not a significant difference in precinct make-up in terms of age.

As the majority of the records deal with voting records, the data was transformed into a binary classification of Voted: True/False and apriori analysis was conducted using the 2017 Primary and 2017 General Elections as the outcomes for rule-making. This resulted in 202 association rules for Primary 2017 and 65 rules for General 2017.

#Rules for General elections
rules.G2017 <- apriori(voter.apriori[,1:16], parameter=list (supp=0.05,conf = 0.5), appearance = list (default="lhs",rhs="Voted_X11.7.2017"), control = list (verbose=F))

#Examine General Election Rules
rules.G2017
## set of 65 rules
##      lhs                   rhs                   support confidence     lift count
## [1]  {Voted_X11.3.2015,                                                           
##       Voted_X11.8.2016,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05354303  0.8393352 4.008268   606
## [2]  {Voted_X11.3.2015,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05531013  0.8357810 3.991295   626
## [3]  {Voted_X11.6.2012,                                                           
##       Voted_X11.5.2013,                                                           
##       Voted_X11.8.2016,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05257113  0.8117326 3.876451   595
## [4]  {Voted_X11.6.2012,                                                           
##       Voted_X11.5.2013,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05478000  0.8115183 3.875428   620
## [5]  {Voted_X11.5.2013,                                                           
##       Voted_X11.8.2016,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05575190  0.8100128 3.868239   631
## [6]  {Voted_X11.2.2010,                                                           
##       Voted_X11.4.2014,                                                           
##       Voted_X11.8.2016,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05036225  0.8096591 3.866549   570
## [7]  {Voted_X11.5.2013,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05813748  0.8073620 3.855579   658
## [8]  {Voted_X11.2.2010,                                                           
##       Voted_X11.6.2012,                                                           
##       Voted_X11.4.2014,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05045061  0.8064972 3.851449   571
## [9]  {Voted_X11.2.2010,                                                           
##       Voted_X11.4.2014,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.05195264  0.8054795 3.846589   588
## [10] {Voted_X11.6.2012,                                                           
##       Voted_X11.4.2014,                                                           
##       Voted_X11.8.2016,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.06370383  0.7949283 3.796202   721
##      lhs                   rhs                   support confidence     lift count
## [1]  {Voted_X8.15.2017} => {Voted_X11.7.2017} 0.11168051  0.7247706 3.461162  1264
## [2]  {Voted_X11.8.2016,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.10293338  0.7410941 3.539115  1165
## [3]  {Voted_X11.5.2013,                                                           
##       Voted_X11.8.2016} => {Voted_X11.7.2017} 0.09135890  0.5146839 2.457887  1034
## [4]  {Voted_X11.6.2012,                                                           
##       Voted_X11.5.2013} => {Voted_X11.7.2017} 0.08976851  0.5029703 2.401948  1016
## [5]  {Voted_X11.3.2015} => {Voted_X11.7.2017} 0.08711787  0.5615034 2.681475   986
## [6]  {Voted_X11.6.2012,                                                           
##       Voted_X11.5.2013,                                                           
##       Voted_X11.8.2016} => {Voted_X11.7.2017} 0.08623432  0.5219251 2.492468   976
## [7]  {Voted_X11.6.2012,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.08526241  0.7533177 3.597489   965
## [8]  {Voted_X11.3.2015,                                                           
##       Voted_X11.8.2016} => {Voted_X11.7.2017} 0.08376038  0.5690276 2.717407   948
## [9]  {Voted_X11.6.2012,                                                           
##       Voted_X11.8.2016,                                                           
##       Voted_X8.15.2017} => {Voted_X11.7.2017} 0.08172822  0.7594417 3.626735   925
## [10] {Voted_X11.5.2013,                                                           
##       Voted_X11.4.2014} => {Voted_X11.7.2017} 0.07819403  0.5473098 2.613693   885

The rules were reviewed and commonalities were identified for use in identifying which instances of the voting history were most relevant to later outcomes.

Most Common Elections from Apriori Rules
8/15/2017
11/8/2016
11/3/2015
11/4/2014
11/5/2013
11/6/2012
11/2/2010

Data Analysis

K-means Clustering

First it was attempted to see if there were any groupings of voters based upon their voting history and age (Option 1). The data was analyzed using within-sum-of-squares to determine the optimal k for clustering.

It was determined to use k=3.

clu.voters1.size Age Voted_X6.22.2010 Voted_X11.2.2010 Voted_X9.13.2011 Voted_X11.8.2011 Voted_X6.26.2012 Voted_X11.6.2012 Voted_X8.13.2013 Voted_X11.5.2013 Voted_X6.24.2014 Voted_X11.4.2014 Voted_X8.11.2015 Voted_X11.3.2015 Voted_X6.28.2016 Voted_X11.8.2016 Voted_X8.15.2017 Voted_X11.7.2017 Voted_X6.26.2018 Voted_X11.6.2018 Voted_X11.5.2019
1983 66.21432 0.2072617 0.4911750 0.1043873 0.1820474 0.2702975 0.6807867 0.1412002 0.3706505 0.1018659 0.5385779 0.1674231 0.2955119 0.2455875 0.8023197 0.3469491 0.4412506 0.4977307 0.8340898 0.4583964
4064 27.92618 0.0164862 0.0725886 0.0083661 0.0191929 0.0295276 0.3228346 0.0145177 0.0494587 0.0073819 0.1026083 0.0152559 0.0509350 0.0406004 0.5720965 0.0760335 0.1013780 0.1023622 0.5243602 0.1395177
5271 42.89148 0.0939101 0.3204326 0.0861317 0.1646746 0.1185733 0.6619237 0.0785430 0.2375261 0.0292165 0.3614115 0.0840448 0.1826978 0.0855625 0.7981408 0.1417188 0.2054639 0.2054639 0.6808955 0.2822994

This clustering resulted in inconclusive results as the centroid clusters weren’t clearly differentiated. It was determined to attempt clustering using only the elections which were identified by the apriori analysis previously discussed.

It was determined to use k = 4.

clu.voters2.size Age Voted_X11.2.2010 Voted_X11.6.2012 Voted_X11.5.2013 Voted_X11.4.2014 Voted_X11.3.2015 Voted_X11.8.2016 Voted_X8.15.2017
2751 1.695061 0.7728099 0.9701927 0.6586696 0.9160305 0.5038168 0.9291167 0.3678662
2998 1.515949 0.0490327 0.1914610 0.0183456 0.0456971 0.0136758 0.0000000 0.0436958
2661 1.563180 0.0477264 0.0000000 0.0289365 0.0950770 0.0571214 1.0000000 0.1180008
2908 1.605510 0.1918845 1.0000000 0.0839065 0.1650619 0.0608666 0.9993122 0.0986933

This resulted in more defined clusters, one which clearly represented likely voters, one which represented rare voters. The other two clusters didn’t clearly favor one or the other and weren’t readily distinguishable from each other. It was determined to move forward with only the “Likely Voter” cluster.

This “Likely Voters” Cluster was plotted on a map of Saratoga Springs to see if there were any identifiable patterns.

This plot was then compared with a plot of Actual Voters that was obtained from the data set.

When compared visually there are significant similarities, but it is possible even visually to determine that there were differences between the two plots, with their Actual Voter plot showing differing distribution as well as more overall voters in some areas of the city.

#Number of Likely Voters
nrow(likely.voters)
## [1] 2751
#Number of Actual Voters
nrow(actual.voters)
## [1] 2964

When comparing these numbers, it appears that clustering does a pretty accurate job at identifying likely voters, however looking a little deeper at the number of voters that appear in both the Likely Voter subset and the Actual Voter subset is illuminating.

correct.kmeans <- semi_join(actual.voters, likely.voters, by = "Voter.ID")
nrow(correct.kmeans)
## [1] 1437

The actual number of correctly identified voters is 1437, this only represents 48% accuracy.


When looking at the second plot of Actual Voters it was noted that the locations of higher voter density seemed to be located close to the homes of candidates. Data was transformed to add a variable which calculated the distance (m) between the voter’s home and that of the nearest candidate.

Revising the Clustering to Include Distance from Candidates

In order to determine the influence that distance from a candidate that might have on clustering, the distance (m) between the voter and the closest candidate was calculated and added to the data. This number was then log10 transformed to standardize it’s effect on clustering. This data set was then analyzed to determine the optimal k value.

It was determined to use k = 4.

clu.voters3.size Age Voted_X11.2.2010 Voted_X11.6.2012 Voted_X11.5.2013 Voted_X11.4.2014 Voted_X11.3.2015 Voted_X11.8.2016 Voted_X8.15.2017 min.c.dist
2193 1.561480 0.0469676 0.0000000 0.0287278 0.0939352 0.0565435 1.0000000 0.1158231 2.955462
2250 1.693608 0.7711111 0.9697778 0.6608889 0.9133333 0.5057778 0.9315556 0.3662222 2.933730
2484 1.517365 0.0438808 0.1916264 0.0181159 0.0454911 0.0136876 0.0000000 0.0390499 2.956943
2413 1.606829 0.1989225 1.0000000 0.0841276 0.1595524 0.0646498 0.9975135 0.1002901 2.940828
nrow(actual.voters)
## [1] 2964
correct.kmeans.dist <- semi_join(actual.voters, likely.voters.dist, by = "Voter.ID")
nrow(correct.kmeans.dist)
## [1] 1167

We can see that inclusion of the distance from candidates actually decreases the accuracy somewhat.

Conclusion

While the model has some value as a means to eliminate likely Non-Voters, it is not yet robust enough in order to correctly identify which voters are likely to participate in municipal elections.